6 research outputs found
DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis
Training convolutional neural networks (CNNs) requires intense compute
throughput and high memory bandwidth. Especially, convolution layers account
for the majority of the execution time of CNN training, and GPUs are commonly
used to accelerate these layer workloads. GPU design optimization for efficient
CNN training acceleration requires the accurate modeling of how their
performance improves when computing and memory resources are increased. We
present DeLTA, the first analytical model that accurately estimates the traffic
at each GPU memory hierarchy level, while accounting for the complex reuse
patterns of a parallel convolution algorithm. We demonstrate that our model is
both accurate and robust for different CNNs and GPU architectures. We then show
how this model can be used to carefully balance the scaling of different GPU
resources for efficient CNN performance improvement
Near Data Acceleration with Concurrent Host Access
Near-data accelerators (NDAs) that are integrated with main memory have the
potential for significant power and performance benefits. Fully realizing these
benefits requires the large available memory capacity to be shared between the
host and the NDAs in a way that permits both regular memory access by some
applications and accelerating others with an NDA, avoids copying data, enables
collaborative processing, and simultaneously offers high performance for both
host and NDA. We identify and solve new challenges in this context: mitigating
row-locality interference from host to NDAs, reducing read/write-turnaround
overhead caused by fine-grain interleaving of host and NDA requests,
architecting a memory layout that supports the locality required for NDAs and
sophisticated address interleaving for host performance, and supporting both
packetized and traditional memory interfaces. We demonstrate our approach in a
simulated system that consists of a multi-core CPU and NDA-enabled DDR4 memory
modules. We show that our mechanisms enable effective and efficient concurrent
access using a set of microbenchmarks, and then demonstrate the potential of
the system for the important stochastic variance-reduced gradient (SVRG)
algorithm
Recommended from our members
Efficient deep neural network model training by reducing memory and compute demands
Deep neural network models are commonly used in various real-life applications due to their high prediction accuracy for different tasks. In particular, CNN (convolutional neural network) models have become the de facto choices for most vision applications such as image classification, object segmentation, and object detection. Modern CNN models contain hundreds of million of parameters and training them requires millions of computation- and memory access-heavy iterations. To reduce this expensive CNN model training cost, this dissertation presents computation and memory cost-efficient training mechanisms with a combination of workload scheduling, learning algorithm, and accelerator architecture optimizations. This dissertation also introduces a performance model for data-parallel accelerators as a fast and accurate method to estimate the performance impact of the proposed architectural optimizations and to help fine-grain accelerator design space exploration. The first part of this dissertation discusses reducing the memory bandwidth demand for CNN training. I first analyze data reuse opportunities in CNN training and show that CNN training has high data locality between network layers but that conventional training mechanisms fail to utilize this inter-layer locality. Then, I develop a CNN training scheduling mechanism that modifies the network execution ordering in a way that captures the inter-layer locality while supporting high compute resource utilization. I also introduce a training accelerator that adopts architectural optimizations that hide additional data transfers caused by the proposed scheduling modification and realize effective training speedup. The proposed training accelerator has 45 mixed precision FLOPS and, with the memory bandwidth-efficient network training scheduling, beats a state-of-the-art GPU that has ∼3X higher peak FLOPs. The second part of this dissertation focuses on reducing the computation cost of CNN training. To reduce computations during training, I use neural network model pruning from the beginning of training. The insight is that a fully trained CNN model contains many non-critical parameters and pruning such parameters during training has only a minor impact on the learning quality. I also choose to structurally prune these parameters to provide high data parallelism avoiding complex data indexing, thus maintaining high compute resource utilization. For the practical implementation of pruning while training, I propose three algorithmic optimizations. Theses optimizations are designed to remove the need for the memory accesses caused by tensor reshaping, reduce the number of training runs in finding the desired pruning hyper-parameters, and maintain high data parallelism even for processing a highly pruned CNN model. Overall, the proposed algorithm speeds up the training of commonly used state-of-the-art image classifiers by 39% with only 1.9% accuracy loss. The third part of this dissertation deals with training pruned CNN models on accelerators with large systolic arrays. I first show my observation that processing structurally-pruned CNN models on a large systolic array severely underutilizes its PEs (processing elements) because the reduced number of channels decreases parallelism. Then, I show that naively splitting a large core into multiple small cores improves PE utilization but decreases input reuse and incurs >4% area overhead. To improve PE utilization and maintain high input reuse, I propose a flexible systolic array architecture that can reconfigure its structure to one of several modes, each designed for efficient execution of CNN layers with different dimensions. I also develop compile-time heuristics that optimize mapping the layer workload to the flexible systolic array resources for both high performance and energy efficiency. My new mechanisms increase PE utilization by 36% compared to a single large-core design and improve training energy efficiency by 18% compared to many-small-core designs. The last part of this dissertation is about developing an accelerator performance model for accurate CNN execution time estimation. For accurate performance modeling, I introduce a memory traffic model that predicts the data traffic at different levels of the GPU memory system hierarchy. This involves an in-depth analysis of the memory access patterns of data-parallel convolution kernels and the spatial locality. I demonstrate that the proposed performance model can provide guideline to fine-tune the GPU resources for efficient CNN performance scaling.Electrical and Computer Engineerin
Reducing Activation Recomputation in Large Transformer Models
Training large transformer models is one of the most important computational
challenges of modern AI. In this paper, we show how to significantly accelerate
training of large transformer models by reducing activation recomputation.
Activation recomputation is commonly used to work around memory capacity
constraints. Rather than storing activations for backpropagation, they are
traditionally recomputed, which saves memory but adds redundant compute. In
this work, we show most of this redundant compute is unnecessary because we can
reduce memory consumption sufficiently without it. We present two novel yet
very simple techniques: sequence parallelism and selective activation
recomputation. In conjunction with tensor parallelism, these techniques almost
eliminate the need to recompute activations. We evaluate our approach on
language models up to one trillion parameters in scale and show that our method
reduces activation memory by 5x, while reducing execution time overhead from
activation recomputation by over 90%. For example, when training a 530B
parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops
Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using
recomputation. Our implementation will be available in both Megatron-LM and
NeMo-Megatron